| sepal length | sepal width | petal length | petal width | species | |
|---|---|---|---|---|---|
| 0 | 5.7 | 3.8 | 1.7 | 0.3 | 0 |
| 1 | 6.3 | 3.3 | 6.0 | 2.5 | 2 |
| 2 | 6.9 | 3.2 | 5.7 | 2.3 | 2 |
| ... | ... | ... | ... | ... | ... |
| 147 | 4.8 | 3.4 | 1.9 | 0.2 | 0 |
| 148 | 6.4 | 2.7 | 5.3 | 1.9 | 2 |
| 149 | 4.6 | 3.4 | 1.4 | 0.3 | 0 |
150 rows × 5 columns
Machine Learning (ML)
Feeding data into a computer algorithm in order to learn patterns and make predictions in new and different situations.
ML Model
Computer object implementing a ML algorithm, trained on a set of data to perform a given task.
ML is really about learning to make extrapolations on a given task.
Neural Network (NN)
Subtype of ML model inspired from brains. Composed of several interconnected layers of nodes capable of processing and passing information.
Deep Learning (DL)
Subcategory of Machine Learning. Consists in using large NN models (i.e. with a high number of layers) to solve complex problems.
The four main categories of ML are based on the type of dataset used to train the model:
| Supervised | Unsupervised | Semi-supervised | Reinforcement | |
|---|---|---|---|---|
| Input | Data | Data | Data | Environment |
| Ground-truth | Yes | No | Partial | No (reward) |
| Examples | Classification, Regression | Clustering | Anomaly detection | Game playing |
Another way to categorize ML models is based on the type of output they produce:
| Category | Description | Example Outputs | Example Use Cases |
|---|---|---|---|
| Classification | Assign one (or multiple) label(s) chosen from a given list of classes to each element of the input. | “Cat”, “Dog”, “Bird” | Spam detection, Image recognition |
| Regression | Assign one (or multiple) value(s) chosen from a continuous set of values. | 3.5, 7.2, 15.8 | Stock price prediction, Age estimation |
| Clustering | Create categories by grouping together similar inputs. | Cluster 1, Cluster 2 | Customer segmentation, Image compression |
| Anomaly Detection | Detect outliers in the dataset. | Normal, Outlier | Fraud detection, Fault detection |
| Generative Models | Generate new data similar to the training data. | Image, Text, Audio | Image generation, Text completion |
| Ranking | Arrange items in order of relevance or importance. | Rank 1, Rank 2, Rank 3 | Search engine, Recommendation system |
| Reinforcement Learning | Learn a policy to maximize long-term rewards through interaction with an environment. | Policy, Action sequence | Game playing, Robotics control |
| Dimensionality Reduction | Reduce the number of features while retaining meaningful information. | 2D or 3D projection | Visualization, Data compression |
Dataset
A collection of data used to train, validate and test ML models.
Dataset example
| sepal length | sepal width | petal length | petal width | species | |
|---|---|---|---|---|---|
| 0 | 5.7 | 3.8 | 1.7 | 0.3 | 0 |
| 1 | 6.3 | 3.3 | 6.0 | 2.5 | 2 |
| 2 | 6.9 | 3.2 | 5.7 | 2.3 | 2 |
| ... | ... | ... | ... | ... | ... |
| 147 | 4.8 | 3.4 | 1.9 | 0.2 | 0 |
| 148 | 6.4 | 2.7 | 5.3 | 1.9 | 2 |
| 149 | 4.6 | 3.4 | 1.4 | 0.3 | 0 |
150 rows × 5 columns
Instance (or sample)
An instance is one individual entry of the dataset (a row).
Feature (or attribute or variable)
A feature is a piece of information that the model uses to make predictions.
Label (or target or output or class)
A label is a piece of information that the model is trying to predict.
Feature vs. Label
Features and labels are simply different columns in the dataset with different roles.
Instances, features and labels
| Feature 1 | Feature 2 | Feature 3 | Feature 4 | Label | |
|---|---|---|---|---|---|
| Instance 0 | 5.7 | 3.8 | 1.7 | 0.3 | 0 |
| Instance 1 | 6.3 | 3.3 | 6.0 | 2.5 | 2 |
| Instance 2 | 6.9 | 3.2 | 5.7 | 2.3 | 2 |
| ... | ... | ... | ... | ... | ... |
| Instance 147 | 4.8 | 3.4 | 1.9 | 0.2 | 0 |
| Instance 148 | 6.4 | 2.7 | 5.3 | 1.9 | 2 |
| Instance 149 | 4.6 | 3.4 | 1.4 | 0.3 | 0 |
150 rows × 5 columns
Dataset subsets
A ML dataset is usually subdivided into three disjoint subsets, with distinctive role in the training process:
Metaphor of studies: exercises, past years exams and real exam
A tree-like structure used for both classification and regression
An ensemble method that combines multiple decision trees:
Used for classification and regression, effective in high-dimensional spaces:
The kernel trick is a way to map the features in a higher dimensional space without actually computing the new features.
Other methods for supervised learning include:
| Method | Description |
|---|---|
| Linear Regression | Predicts a continuous value with a linear model |
| Logistic Regression | Predicts a binary value with a linear model |
| K-Nearest Neighbors (KNN) | Non-parametric method for classification and regression |
| Boosting | Ensemble method (like Random Forests) that combines weak learners to form a strong model |
| Naive Bayes | Probabilistic classifier based on Bayes’ theorem |
A method for partitioning data into \(k\) clusters:
Builds a hierarchy of clusters using either agglomerative or divisive methods:
Clustering based on the density of data points:
Other methods for unsupervised learning include:
| Method | Description |
|---|---|
| K-Means | Partition data into \(k\) clusters |
| Hierarchical Clustering | Build a hierarchy of clusters |
| DBSCAN | Density-based clustering that groups points closely packed together |
| Gaussian Mixture Models (GMM) | Probabilistic clustering assuming data is generated from multiple Gaussian distributions |
| Principal Component Analysis (PCA) | Reduce dimensionality by finding principal components that explain variance |
| t-SNE | Nonlinear dimensionality reduction for visualizing high-dimensional data |
| Autoencoders | Neural networks that learn efficient representations of data in an unsupervised manner |
| Self-Organizing Maps (SOM) | Neural network-based method for clustering and visualization |
Dimensionality reduction technique to project data into lower dimensions:
A nonlinear dimensionality reduction technique primarily used for visualization of high-dimensional data
Gather the data, potentially from multiple different sources. Choosing the right sources can also depend on the choices made in the next steps.
Multiple sources of issues and steps to perform:
Idea
A priori all features have the same importance, so none of them should have an advantage. Therefore, having features with larger values than others would be detrimental.
Usually, all features are individually normalized over the whole dataset, to obtain a distribution with an average of 0 and a standard deviation of 1:
\[ \begin{align*} \hat{X} & = \sum\limits_{j=0}^n X_j \\ \sigma_X & = \sum\limits_{j=0}^n (X_j - \hat{X})^2 \\ \forall k \in [0, \cdots, n ], X_k & = \frac{X_k - \hat{X}}{\sigma_X} \end{align*} \]
Criteria selection among the many possible ones:
Cross-validation
Method to estimate real performance of the model by:
Once the data is preprocessed, the model is selected, the hyperparameters chosen and optimized, the final model can be trained multiple times to keep the best one.
Quality of the data is obviously crucial to train well-performing models. Quality encompasses multiple aspects:
Diversity is the most important aspect of a dataset because ML models are great at generalizing but bad at guessing in new scenarios. There are different aspects to diversity to keep in mind:
Biased
Refers to a model which always makes the same kind of wrong predictions in similar cases.
In practice, a model trained on biased data will most of the time repeat the biased results. This can have major consequences and shouldn’t be underestimated: even a cold-hearted ML algorithm is not objective if it wasn’t trained on objectively chosen and annotated data.
However, there exist model architectures, training and evaluation methods to prevent and detect biases, which can sometimes allow to build unbiased models using biased data. But this needs to be well-thought and won’t happen unless
Underfitting
When a model is too simple to properly extract information from a complex task. Can also be explained by key information missing in the input features.
Overfitting
When a model is too complex to properly generalize to new data. Happens often when a NN is trained too long on a dataset that is not diverse enough and learns the noise in the data.
| Solution | Underfitting | Overfitting |
|---|---|---|
| Complexity | Increase | Reduce |
| Number of features | Increase | Reduce |
| Regularization | Reduce | Increase |
| Training time | Increase | Reduce |
General strategies:
Interpretable
Qualifies a ML model which decision-making process is straightforward and transparent, making it directly understandable by humans. This requires to restrict the model complexity.
Explainable
Qualifies a ML model which decision-making process can be partly interpreted afterwards using post hoc interpretation techniques. These techniques are often used on models which are too complex to be interpreted.